A Causal Bayesian Network View of Reinforcement Learning

نویسندگان

  • Charles W. Fox
  • Neil Girdhar
  • Kevin N. Gurney
چکیده

Reinforcement Learning (RL) is a heuristic method for learning locally optimal policies in Markov Decision Processes (MDP). Its classical formulation (Sutton & Barto 1998) maintains point estimates of the expected values of states or state-action pairs. Bayesian RL (Dearden, Friedman, & Russell 1998) extends this to beliefs over values. However the concept of values sits uneasily with the original notion of Bayesian Networks (BNs), which were defined (Pearl 1988) as having explicitly causal semantics. In this paper we show how Bayesian RL can be cast in an explicitly Bayesian Network formalism, making use of backwards-in-time causality. We show how the heuristic used by RL can be seen as an instance of a more general BN inference heuristic, which cuts causal links in the network and replaces them with non-causal approximate hashing links for speed. This view brings RL into line with standard Bayesian AI concepts, and suggests similar hashing heuristics for other general inference tasks. Introduction Reinforcement Learning An MDP is a tuple (S,A, ps, pr) where s ∈ S are states, a ∈ A are actions, ps(s′|s, a) are transition probabilities and pr(r|s, a) are reward probabilities. The goal is to select a sequence of actions {at} (a plan) over time t to maximise the expected value 〈vt〉 = 〈 ∑T t=1 γr〉, where T may be infinite, and each action is selected as a function of the current, observable state at = at(st). We consider the case where ps and pr are unknown. Classical Reinforcement Learning approximates the solution using some parametric, point estimate function v̂(s, a; θ) and seeks θ̂ to best approximate v̂(st, at; θ̂) ≈ max at+1:T 〈 T ∑ τ=t γτrτ 〉 = 〈rt + max at+1 v̂(s′, a; θ)〉. It runs by choosing at at each step (which may be a = arg maxa v̂(s, a) if best available performance is required; or randomised for ad-hoc exploratory learning), then observing the resulting rt and st+1 and updating θ towards a minimised error value (with w0 + w1 = 1): Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. θ ← w0θ+w1 arg min θ (v̂(st, at; θ ′))−[rt+γ max at+1 v̂(s, a; θ)]) Bayesian RL uses a larger parameter set φ to parametrise and learn a full belief over values Q(v|s, a; φ̂) ≈ P (v|s, a). (So classical RL is a special case where this parametric probability function is assumed to be a Dirac Delta function, Q(v|s, a;φ) = δ(v; v̂(s, a; θ))) Causal Bayesian Networks A Directed Graphical Model (DGM) is a set of variables {Xi} with directed links specified by parent functions {pa(Xi)}, and a set of conditional probabilities {Pi(Xi|pa(Xi))} so the joint is P ({Xi}) = ∏ i P (Xi|pa(Xi)). A Causal Bayesian Network (CBN) is a DGM together with a set of operators do(Xi = xi) which when applied to the model, set pa(Xi) = ∅ and P (Xi) = δ(Xi;xi). The do operators correspond (Pearl 2000) to the effects of performing an intervention on the system being modelled. A DGM of a system ‘respects causal semantics’ if its corresponding CBN faithfully models interventions. (The name ‘Bayesian Networks’ originally referred (Pearl 1988) to CBNs.) While DGMs and CBNs are generally treated as complementary, we will show how a hybrid net with some causal and some acausal links is a useful way to think about Reinforcement Learning algorithms, and suggests generalisations for creating other approximate inference algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Bayesian Network Structure using Markov Blanket in K2 Algorithm

‎A Bayesian network is a graphical model that represents a set of random variables and their causal relationship via a Directed Acyclic Graph (DAG)‎. ‎There are basically two methods used for learning Bayesian network‎: ‎parameter-learning and structure-learning‎. ‎One of the most effective structure-learning methods is K2 algorithm‎. ‎Because the performance of the K2 algorithm depends on node...

متن کامل

An Introduction to Inference and Learning in Bayesian Networks

Bayesian networks (BNs) are modern tools for modeling phenomena in dynamic and static systems and are used in different subjects such as disease diagnosis, weather forecasting, decision making and clustering. A BN is a graphical-probabilistic model which represents causal relations among random variables and consists of a directed acyclic graph and a set of conditional probabilities. Structure...

متن کامل

Autonomous Hierarchical Skill Acquisition in Factored MDPs

Learning hierarchies of reusable skills is essential for efficiently solving multiple tasks in a given domain. Understanding the causal relationships between one’s actions and various dimensions of one’s environment can facilitate learning of abstract skills that may be used subsequently in related tasks. Using Bayesian network structure-learning techniques and structured dynamic programming al...

متن کامل

Dynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)

In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...

متن کامل

Modeling Causal Reinforcement and Undermining with Noisy-AND Trees

Causal modeling, such as noisy-OR, reduces probability parameters to be acquired in constructing a Bayesian network. Multiple causes can reinforce each other in producing the effect or can undermine the impact of each other. Most existing causal models do not consider their interactions from the perspective of reinforcement or undermining. We show that none of them can represent both interactio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008